{ "cells": [ { "cell_type": "markdown", "id": "5fa5cd08", "metadata": {}, "source": [ "# Tutorial 02 - Loading Data from Unstructured Directory\n", "\n", "In Tutorial 01 we assumed a specific folder structure to load the audio files and create a PyTorch Dataset. This is restrictive as in most cases the dataset comes in a folder containing all audio files and the individual splits are determined by some other structure (e.g., `csv` or `json` files, etc.). In this Tutorial we demonstrate an alternative and more Pythonic-way to load your data and create the Audio Classification Dataset." ] }, { "cell_type": "markdown", "id": "a1fd5904", "metadata": {}, "source": [ "## 1. Dataset Downloading & Inspection\n", "\n", "For the purposes of this Tutorial we use the SpeechCommands dataset, we use a small version of the dataset consisting of 12 spoken english commands (e.g., \"down\", \"go\", \"left\", etc.) from various speakers. More information about the dataset can be found in the [HEAR](https://arxiv.org/abs/2203.03022) evaluation benchmark dataset. " ] }, { "cell_type": "code", "execution_count": 1, "id": "be51b64c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2026-04-17 15:24:25-- https://zenodo.org/records/5887964/files/hear2021-speech_commands-v0.0.2-5h-48000.tar.gz?download=1\n", "Resolving zenodo.org (zenodo.org)... 188.185.43.153, 188.185.48.75, 188.184.103.118, ...\n", "Connecting to zenodo.org (zenodo.org)|188.185.43.153|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 1430299345 (1.3G) [application/octet-stream]\n", "Saving to: ‘hear2021-speech_commands-v0.0.2-5h-48000.tar.gz?download=1’\n", "\n", "hear2021-speech_com 100%[===================>] 1.33G 6.69MB/s in 4m 15s \n", "\n", "2026-04-17 15:28:40 (5.35 MB/s) - ‘hear2021-speech_commands-v0.0.2-5h-48000.tar.gz?download=1’ saved [1430299345/1430299345]\n", "\n" ] } ], "source": [ "# We download the dataset from zenodo using wget\n", "\n", "!wget https://zenodo.org/records/5887964/files/hear2021-speech_commands-v0.0.2-5h-48000.tar.gz?download=1" ] }, { "cell_type": "code", "execution_count": null, "id": "1b5753fe", "metadata": {}, "outputs": [], "source": [ "# We extract the downloaded tar.gz file and move the contents to the /data directory (folder should exist)\n", "!tar -zxf ./hear2021-speech_commands-v0.0.2-5h-48000.tar.gz?download=1 -C /data" ] }, { "cell_type": "markdown", "id": "40ddbb1a", "metadata": {}, "source": [ "Now the dataset is available at `/data/hear-2021.0.6/tasks/speech_commands-v0.0.2-5h`. The folder contains the following files:\n", "\n", "- labelvocabulary.csv: Containing the class mapping between class names and integer values.\n", "- task_metadata.json: Metadata of the dataset\n", "- train.json: The audio filenames corresponding to the training set.\n", "- valid.json: The audio filenames corresponding to the validation set.\n", "- test.json: The audio filenames corresponding to the test set.\n", "\n", "The folder `48000` contains three subfolders `train`, `test`, `valid`, each containing the respective audio files of the specified split in 48KHz sampling rate format." ] }, { "cell_type": "code", "execution_count": 5, "id": "4633ec79", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'task_name': 'speech_commands',\n", " 'version': 'v0.0.2',\n", " 'embedding_type': 'scene',\n", " 'prediction_type': 'multiclass',\n", " 'split_mode': 'trainvaltest',\n", " 'sample_duration': 1.0,\n", " 'evaluation': ['top1_acc'],\n", " 'download_urls': [{'split': 'train',\n", " 'url': 'http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz',\n", " 'md5': '6b74f3901214cb2c2934e98196829835'},\n", " {'split': 'test',\n", " 'url': 'http://download.tensorflow.org/data/speech_commands_test_set_v0.02.tar.gz',\n", " 'md5': '854c580ee90bff80c516491c84544e32'}],\n", " 'default_mode': '5h',\n", " 'max_task_duration_by_split': {'train': 16000.0,\n", " 'valid': 2000.0,\n", " 'test': None},\n", " 'tmp_dir': '_workdir',\n", " 'mode': '5h',\n", " 'splits': ['train', 'valid', 'test']}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We inspect the contests of the medatata file\n", "import json\n", "from pathlib import Path\n", "\n", "DATA_PATH = Path(\"/data/hear-2021.0.6/tasks/speech_commands-v0.0.2-5h/\")\n", "TRAIN_PATH = DATA_PATH / \"48000\" / \"train\"\n", "TEST_PATH = DATA_PATH / \"48000\" / \"test\"\n", "VALID_PATH = DATA_PATH / \"48000\" / \"valid\"\n", "\n", "with open(DATA_PATH / \"task_metadata.json\", \"r\") as f:\n", " metadata = json.load(f)\n", "\n", "metadata" ] }, { "cell_type": "markdown", "id": "1eb95e7b", "metadata": {}, "source": [ "Through the metadata we see that each audio is 1-second long. Therefore, we will set `segment_duration=1.0` for creating the PyTorch dataset. Below we inspect the format of the json splitting files." ] }, { "cell_type": "code", "execution_count": 7, "id": "adc51b1c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "_silence__doing_the_dishes-1048000.wav ['_silence_']\n" ] } ], "source": [ "with open(DATA_PATH / \"train.json\", \"r\") as f:\n", " train_json = json.load(f)\n", " \n", "# Inspect the first entry in the train.json file\n", "key, value = next(iter(train_json.items()))\n", "\n", "print(key, value)" ] }, { "cell_type": "markdown", "id": "5046c2df", "metadata": {}, "source": [ "We see that the json maps the filenames to the individual classes. We parse the json files for the validation / test splits in similar manner." ] }, { "cell_type": "code", "execution_count": 8, "id": "b3f8bdd9", "metadata": {}, "outputs": [], "source": [ "with open(DATA_PATH / \"test.json\", \"r\") as f:\n", " test_json = json.load(f)\n", " \n", "with open(DATA_PATH / \"valid.json\", \"r\") as f:\n", " valid_json = json.load(f)" ] }, { "cell_type": "markdown", "id": "118f96a2", "metadata": {}, "source": [ "## 2. Dataset Creation using Python Dictionaries\n" ] }, { "cell_type": "markdown", "id": "9e09d3f2", "metadata": {}, "source": [ "Now that we understand the structure of the dataset we can easily create the datasets. We first define the `class_mapping` through the `labelvocabulary.csv` file which is available." ] }, { "cell_type": "code", "execution_count": 12, "id": "25f562c0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'_silence_': 0,\n", " '_unknown_': 1,\n", " 'down': 2,\n", " 'go': 3,\n", " 'left': 4,\n", " 'no': 5,\n", " 'off': 6,\n", " 'on': 7,\n", " 'right': 8,\n", " 'stop': 9,\n", " 'up': 10,\n", " 'yes': 11}" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import csv\n", "\n", "with open(DATA_PATH / \"labelvocabulary.csv\", \"r\") as f:\n", " reader = csv.reader(f)\n", " next(reader) # Skip the header row\n", " label_mapping = {rows[0]: rows[1] for rows in reader}\n", " \n", "class_mapping = {v: int(k) for k, v in label_mapping.items()}\n", "\n", "class_mapping" ] }, { "cell_type": "markdown", "id": "26d7c3ce", "metadata": {}, "source": [ "To instantiate a PyTorch Dataset for audio classification we use the method `audio_classification_dataset_from_dictionary`. The method expects the same arguments as the `audio_classification_dataset_from_dir` with the exception that instead of a path we provide a Python dictionary of the form `{\"\": \"class_name\"}`. This is handled by the `file_to_class_mapping` argument. Luckily for us, this information is contained in the `train_json, valid_json`, and `test_json` variables defined previously." ] }, { "cell_type": "code", "execution_count": 15, "id": "7ff0d938", "metadata": {}, "outputs": [], "source": [ "from deepaudiox import audio_classification_dataset_from_dictionary\n", "\n", "# We only need to prepend the absolute path and index the class label for the dataset\n", "train_json = {str(TRAIN_PATH / key): value[0] for key, value in train_json.items()}\n", "valid_json = {str(VALID_PATH / key): value[0] for key, value in valid_json.items()}\n", "test_json = {str(TEST_PATH / key): value[0] for key, value in test_json.items()}\n", "\n", "train_dset = audio_classification_dataset_from_dictionary(file_to_class_mapping=train_json,\n", " class_mapping=class_mapping,\n", " sample_rate=32000,\n", " segment_duration=1.0)\n", "\n", "valid_dset = audio_classification_dataset_from_dictionary(file_to_class_mapping=valid_json,\n", " class_mapping=class_mapping,\n", " sample_rate=32000,\n", " segment_duration=1.0)\n", "\n", "test_dset = audio_classification_dataset_from_dictionary(file_to_class_mapping=test_json,\n", " class_mapping=class_mapping,\n", " sample_rate=32000,\n", " segment_duration=1.0)" ] }, { "cell_type": "code", "execution_count": 17, "id": "008d2d8a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'path': '/data/hear-2021.0.6/tasks/speech_commands-v0.0.2-5h/48000/train/_silence__doing_the_dishes-1048000.wav', 'y_true': 0, 'class_name': '_silence_', 'segment_idx': 0, 'feature': array([ 0.01144081, 0.00943983, 0.00135719, ..., -0.01853629,\n", " -0.0183027 , -0.0120908 ], shape=(32000,), dtype=float32)}\n" ] } ], "source": [ "# Check the first entry in the training dataset\n", "print(train_dset[0])" ] }, { "cell_type": "code", "execution_count": 18, "id": "9b04bba7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of training samples: 16000\n", "Number of validation samples: 2000\n", "Number of test samples: 4890\n" ] } ], "source": [ "# Check the lengths of the datasets\n", "print(f\"Number of training samples: {len(train_dset)}\")\n", "print(f\"Number of validation samples: {len(valid_dset)}\")\n", "print(f\"Number of test samples: {len(test_dset)}\")" ] }, { "cell_type": "markdown", "id": "f1ca84a4", "metadata": {}, "source": [ "## 3. Initializing the AudioClassifier\n", "\n", "Now the rest is easy. The steps are Classifier Initialization -> Trainer -> Evaluator. We instantiate a simple audio classifier using MobileNet as backbone feature extractor - a lightweight CNN-based architecture enabling fast training. Since the backbone is lightweight we train it from scratch." ] }, { "cell_type": "code", "execution_count": 20, "id": "78263c3a", "metadata": {}, "outputs": [], "source": [ "from deepaudiox import AudioClassifier\n", "\n", "model = AudioClassifier(backbone=\"mobilenet_10_as\",\n", " num_classes=len(class_mapping),\n", " freeze_backbone=False,\n", " pretrained=True,\n", " sample_rate=32_000)" ] }, { "cell_type": "markdown", "id": "56281d84", "metadata": {}, "source": [ "To see all the available backbones on the library use the `AVAILABLE_BACKBONES` variable lists all backbones." ] }, { "cell_type": "code", "execution_count": 21, "id": "5faa9122", "metadata": {}, "outputs": [], "source": [ "from deepaudiox import AVAILABLE_BACKBONES" ] }, { "cell_type": "code", "execution_count": 24, "id": "e7f874b8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "AudioClassifierConstructor(\n", " (backbone_constructor): BackboneConstructor(\n", " (backbone): MobileNet(\n", " (features): Sequential(\n", " (0): Conv2dNormActivation(\n", " (0): Conv2d(1, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)\n", " (1): BatchNorm2d(16, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): Hardswish()\n", " )\n", " (1): InvertedResidual(\n", " (block): Sequential(\n", " (0): Conv2dNormActivation(\n", " (0): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=16, bias=False)\n", " (1): BatchNorm2d(16, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): ReLU(inplace=True)\n", " )\n", " (1): Conv2dNormActivation(\n", " (0): Conv2d(16, 16, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", " (1): BatchNorm2d(16, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " )\n", " )\n", " )\n", " (2): InvertedResidual(\n", " (block): Sequential(\n", " (0): Conv2dNormActivation(\n", " (0): Conv2d(16, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", " (1): BatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): ReLU(inplace=True)\n", " )\n", " (1): Conv2dNormActivation(\n", " (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=64, bias=False)\n", " (1): BatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): ReLU(inplace=True)\n", " )\n", " (2): Conv2dNormActivation(\n", " (0): Conv2d(64, 24, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", " (1): BatchNorm2d(24, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " )\n", " )\n", " )\n", " (3): InvertedResidual(\n", " (block): Sequential(\n", " (0): Conv2dNormActivation(\n", " (0): Conv2d(24, 72, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", " (1): BatchNorm2d(72, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): ReLU(inplace=True)\n", " )\n", " (1): Conv2dNormActivation(\n", " (0): Conv2d(72, 72, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=72, bias=False)\n", " (1): BatchNorm2d(72, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): ReLU(inplace=True)\n", " )\n", " (2): Conv2dNormActivation(\n", " (0): Conv2d(72, 24, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", " (1): BatchNorm2d(24, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " )\n", " )\n", " )\n", " (4): InvertedResidual(\n", " (block): Sequential(\n", " (0): Conv2dNormActivation(\n", " (0): Conv2d(24, 72, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", " (1): BatchNorm2d(72, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): ReLU(inplace=True)\n", " )\n", " (1): Conv2dNormActivation(\n", " (0): Conv2d(72, 72, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), groups=72, bias=False)\n", " (1): BatchNorm2d(72, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): ReLU(inplace=True)\n", " )\n", " (2): ConcurrentSEBlock(\n", " (conc_se_layers): ModuleList(\n", " (0): SqueezeExcitation(\n", " (fc1): Linear(in_features=72, out_features=24, bias=True)\n", " (fc2): Linear(in_features=24, out_features=72, bias=True)\n", " (activation): ReLU()\n", " (scale_activation): Sigmoid()\n", " )\n", " )\n", " )\n", " (3): Conv2dNormActivation(\n", " (0): Conv2d(72, 40, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", " (1): BatchNorm2d(40, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " )\n", " )\n", " )\n", " (5): InvertedResidual(\n", " (block): Sequential(\n", " (0): Conv2dNormActivation(\n", " (0): Conv2d(40, 120, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", " (1): BatchNorm2d(120, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): ReLU(inplace=True)\n", " )\n", " (1): Conv2dNormActivation(\n", " (0): Conv2d(120, 120, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=120, bias=False)\n", " (1): BatchNorm2d(120, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): ReLU(inplace=True)\n", " )\n", " (2): ConcurrentSEBlock(\n", " (conc_se_layers): ModuleList(\n", " (0): SqueezeExcitation(\n", " (fc1): Linear(in_features=120, out_features=32, bias=True)\n", " (fc2): Linear(in_features=32, out_features=120, bias=True)\n", " (activation): ReLU()\n", " (scale_activation): Sigmoid()\n", " )\n", " )\n", " )\n", " (3): Conv2dNormActivation(\n", " (0): Conv2d(120, 40, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", " (1): BatchNorm2d(40, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " )\n", " )\n", " )\n", " (6): InvertedResidual(\n", " (block): Sequential(\n", " (0): Conv2dNormActivation(\n", " (0): Conv2d(40, 120, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", " (1): BatchNorm2d(120, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): ReLU(inplace=True)\n", " )\n", " (1): Conv2dNormActivation(\n", " (0): Conv2d(120, 120, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=120, bias=False)\n", " (1): BatchNorm2d(120, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): ReLU(inplace=True)\n", " )\n", " (2): ConcurrentSEBlock(\n", " (conc_se_layers): ModuleList(\n", " (0): SqueezeExcitation(\n", " (fc1): Linear(in_features=120, out_features=32, bias=True)\n", " (fc2): Linear(in_features=32, out_features=120, bias=True)\n", " (activation): ReLU()\n", " (scale_activation): Sigmoid()\n", " )\n", " )\n", " )\n", " (3): Conv2dNormActivation(\n", " (0): Conv2d(120, 40, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", " (1): BatchNorm2d(40, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " )\n", " )\n", " )\n", " (7): InvertedResidual(\n", " (block): Sequential(\n", " (0): Conv2dNormActivation(\n", " (0): Conv2d(40, 240, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", " (1): BatchNorm2d(240, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): Hardswish()\n", " )\n", " (1): Conv2dNormActivation(\n", " (0): Conv2d(240, 240, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=240, bias=False)\n", " (1): BatchNorm2d(240, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): Hardswish()\n", " )\n", " (2): Conv2dNormActivation(\n", " (0): Conv2d(240, 80, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", " (1): BatchNorm2d(80, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " )\n", " )\n", " )\n", " (8): InvertedResidual(\n", " (block): Sequential(\n", " (0): Conv2dNormActivation(\n", " (0): Conv2d(80, 200, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", " (1): BatchNorm2d(200, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): Hardswish()\n", " )\n", " (1): Conv2dNormActivation(\n", " (0): Conv2d(200, 200, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=200, bias=False)\n", " (1): BatchNorm2d(200, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): Hardswish()\n", " )\n", " (2): Conv2dNormActivation(\n", " (0): Conv2d(200, 80, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", " (1): BatchNorm2d(80, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " )\n", " )\n", " )\n", " (9): InvertedResidual(\n", " (block): Sequential(\n", " (0): Conv2dNormActivation(\n", " (0): Conv2d(80, 184, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", " (1): BatchNorm2d(184, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): Hardswish()\n", " )\n", " (1): Conv2dNormActivation(\n", " (0): Conv2d(184, 184, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=184, bias=False)\n", " (1): BatchNorm2d(184, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): Hardswish()\n", " )\n", " (2): Conv2dNormActivation(\n", " (0): Conv2d(184, 80, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", " (1): BatchNorm2d(80, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " )\n", " )\n", " )\n", " (10): InvertedResidual(\n", " (block): Sequential(\n", " (0): Conv2dNormActivation(\n", " (0): Conv2d(80, 184, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", " (1): BatchNorm2d(184, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): Hardswish()\n", " )\n", " (1): Conv2dNormActivation(\n", " (0): Conv2d(184, 184, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=184, bias=False)\n", " (1): BatchNorm2d(184, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): Hardswish()\n", " )\n", " (2): Conv2dNormActivation(\n", " (0): Conv2d(184, 80, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", " (1): BatchNorm2d(80, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " )\n", " )\n", " )\n", " (11): InvertedResidual(\n", " (block): Sequential(\n", " (0): Conv2dNormActivation(\n", " (0): Conv2d(80, 480, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", " (1): BatchNorm2d(480, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): Hardswish()\n", " )\n", " (1): Conv2dNormActivation(\n", " (0): Conv2d(480, 480, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=480, bias=False)\n", " (1): BatchNorm2d(480, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): Hardswish()\n", " )\n", " (2): ConcurrentSEBlock(\n", " (conc_se_layers): ModuleList(\n", " (0): SqueezeExcitation(\n", " (fc1): Linear(in_features=480, out_features=120, bias=True)\n", " (fc2): Linear(in_features=120, out_features=480, bias=True)\n", " (activation): ReLU()\n", " (scale_activation): Sigmoid()\n", " )\n", " )\n", " )\n", " (3): Conv2dNormActivation(\n", " (0): Conv2d(480, 112, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", " (1): BatchNorm2d(112, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " )\n", " )\n", " )\n", " (12): InvertedResidual(\n", " (block): Sequential(\n", " (0): Conv2dNormActivation(\n", " (0): Conv2d(112, 672, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", " (1): BatchNorm2d(672, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): Hardswish()\n", " )\n", " (1): Conv2dNormActivation(\n", " (0): Conv2d(672, 672, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=672, bias=False)\n", " (1): BatchNorm2d(672, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): Hardswish()\n", " )\n", " (2): ConcurrentSEBlock(\n", " (conc_se_layers): ModuleList(\n", " (0): SqueezeExcitation(\n", " (fc1): Linear(in_features=672, out_features=168, bias=True)\n", " (fc2): Linear(in_features=168, out_features=672, bias=True)\n", " (activation): ReLU()\n", " (scale_activation): Sigmoid()\n", " )\n", " )\n", " )\n", " (3): Conv2dNormActivation(\n", " (0): Conv2d(672, 112, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", " (1): BatchNorm2d(112, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " )\n", " )\n", " )\n", " (13): InvertedResidual(\n", " (block): Sequential(\n", " (0): Conv2dNormActivation(\n", " (0): Conv2d(112, 672, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", " (1): BatchNorm2d(672, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): Hardswish()\n", " )\n", " (1): Conv2dNormActivation(\n", " (0): Conv2d(672, 672, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), groups=672, bias=False)\n", " (1): BatchNorm2d(672, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): Hardswish()\n", " )\n", " (2): ConcurrentSEBlock(\n", " (conc_se_layers): ModuleList(\n", " (0): SqueezeExcitation(\n", " (fc1): Linear(in_features=672, out_features=168, bias=True)\n", " (fc2): Linear(in_features=168, out_features=672, bias=True)\n", " (activation): ReLU()\n", " (scale_activation): Sigmoid()\n", " )\n", " )\n", " )\n", " (3): Conv2dNormActivation(\n", " (0): Conv2d(672, 160, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", " (1): BatchNorm2d(160, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " )\n", " )\n", " )\n", " (14): InvertedResidual(\n", " (block): Sequential(\n", " (0): Conv2dNormActivation(\n", " (0): Conv2d(160, 960, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", " (1): BatchNorm2d(960, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): Hardswish()\n", " )\n", " (1): Conv2dNormActivation(\n", " (0): Conv2d(960, 960, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=960, bias=False)\n", " (1): BatchNorm2d(960, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): Hardswish()\n", " )\n", " (2): ConcurrentSEBlock(\n", " (conc_se_layers): ModuleList(\n", " (0): SqueezeExcitation(\n", " (fc1): Linear(in_features=960, out_features=240, bias=True)\n", " (fc2): Linear(in_features=240, out_features=960, bias=True)\n", " (activation): ReLU()\n", " (scale_activation): Sigmoid()\n", " )\n", " )\n", " )\n", " (3): Conv2dNormActivation(\n", " (0): Conv2d(960, 160, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", " (1): BatchNorm2d(160, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " )\n", " )\n", " )\n", " (15): InvertedResidual(\n", " (block): Sequential(\n", " (0): Conv2dNormActivation(\n", " (0): Conv2d(160, 960, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", " (1): BatchNorm2d(960, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): Hardswish()\n", " )\n", " (1): Conv2dNormActivation(\n", " (0): Conv2d(960, 960, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=960, bias=False)\n", " (1): BatchNorm2d(960, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): Hardswish()\n", " )\n", " (2): ConcurrentSEBlock(\n", " (conc_se_layers): ModuleList(\n", " (0): SqueezeExcitation(\n", " (fc1): Linear(in_features=960, out_features=240, bias=True)\n", " (fc2): Linear(in_features=240, out_features=960, bias=True)\n", " (activation): ReLU()\n", " (scale_activation): Sigmoid()\n", " )\n", " )\n", " )\n", " (3): Conv2dNormActivation(\n", " (0): Conv2d(960, 160, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", " (1): BatchNorm2d(160, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " )\n", " )\n", " )\n", " (16): Conv2dNormActivation(\n", " (0): Conv2d(160, 960, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", " (1): BatchNorm2d(960, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)\n", " (2): Hardswish()\n", " )\n", " )\n", " (feature_extractor): AugmentMelSTFT(\n", " (freqm): FrequencyMasking()\n", " (timem): TimeMasking()\n", " )\n", " )\n", " (pooling): GAP()\n", " )\n", " (classifier): MLPHead(\n", " (model): Sequential(\n", " (0): Linear(in_features=960, out_features=12, bias=True)\n", " )\n", " )\n", ")" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Model Inspection\n", "model" ] }, { "cell_type": "markdown", "id": "05032396", "metadata": {}, "source": [ "## 4. Training\n", "\n", "Now we are ready to train our model for speech command classification. Note that in this case, the dataset comes with a predetermined validation dataset where we can utilize during training." ] }, { "cell_type": "code", "execution_count": 25, "id": "569f9461", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Epoch 1/50]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Using GPU: NVIDIA GeForce RTX 4090\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Epoch 1 | Train Loss: 1.5644 | Val. Loss: 1.5606 | Time: 3.32s \n", "[CHECKPOINTER] Validation loss decreased: (inf --> 1.560594), \u001b[92m(-nan%)\u001b[0m.\n", "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n", "[Epoch 2/50]\n", "Epoch 2 | Train Loss: 1.4823 | Val. Loss: 0.9147 | Time: 2.62s \n", "[CHECKPOINTER] Validation loss decreased: (1.560594 --> 0.914667), \u001b[92m(-41.39%)\u001b[0m.\n", "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n", "[Epoch 3/50]\n", "Epoch 3 | Train Loss: 1.3784 | Val. Loss: 1.4767 | Time: 2.66s \n", "[Epoch 4/50]\n", "Epoch 4 | Train Loss: 1.3140 | Val. Loss: 0.4436 | Time: 2.60s \n", "[CHECKPOINTER] Validation loss decreased: (0.914667 --> 0.443601), \u001b[92m(-51.50%)\u001b[0m.\n", "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n", "[Epoch 5/50]\n", "Epoch 5 | Train Loss: 1.3024 | Val. Loss: 0.3455 | Time: 2.65s \n", "[CHECKPOINTER] Validation loss decreased: (0.443601 --> 0.345517), \u001b[92m(-22.11%)\u001b[0m.\n", "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n", "[Epoch 6/50]\n", "Epoch 6 | Train Loss: 1.2830 | Val. Loss: 0.4031 | Time: 2.65s \n", "[Epoch 7/50]\n", "Epoch 7 | Train Loss: 1.2611 | Val. Loss: 0.2845 | Time: 2.62s \n", "[CHECKPOINTER] Validation loss decreased: (0.345517 --> 0.284467), \u001b[92m(-17.67%)\u001b[0m.\n", "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n", "[Epoch 8/50]\n", "Epoch 8 | Train Loss: 1.2377 | Val. Loss: 0.2525 | Time: 2.67s \n", "[CHECKPOINTER] Validation loss decreased: (0.284467 --> 0.252507), \u001b[92m(-11.23%)\u001b[0m.\n", "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n", "[Epoch 9/50]\n", "Epoch 9 | Train Loss: 1.2382 | Val. Loss: 0.2880 | Time: 2.68s \n", "[Epoch 10/50]\n", "Epoch 10 | Train Loss: 1.2242 | Val. Loss: 0.2928 | Time: 2.65s \n", "[Epoch 11/50]\n", "Epoch 11 | Train Loss: 1.2149 | Val. Loss: 0.2199 | Time: 2.61s \n", "[CHECKPOINTER] Validation loss decreased: (0.252507 --> 0.219867), \u001b[92m(-12.93%)\u001b[0m.\n", "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n", "[Epoch 12/50]\n", "Epoch 12 | Train Loss: 1.2129 | Val. Loss: 0.2184 | Time: 2.68s \n", "[CHECKPOINTER] Validation loss decreased: (0.219867 --> 0.218392), \u001b[92m(-0.67%)\u001b[0m.\n", "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n", "[Epoch 13/50]\n", "Epoch 13 | Train Loss: 1.2014 | Val. Loss: 0.1860 | Time: 2.70s \n", "[CHECKPOINTER] Validation loss decreased: (0.218392 --> 0.185957), \u001b[92m(-14.85%)\u001b[0m.\n", "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n", "[Epoch 14/50]\n", "Epoch 14 | Train Loss: 1.2113 | Val. Loss: 0.1765 | Time: 2.67s \n", "[CHECKPOINTER] Validation loss decreased: (0.185957 --> 0.176475), \u001b[92m(-5.10%)\u001b[0m.\n", "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n", "[Epoch 15/50]\n", "Epoch 15 | Train Loss: 1.1968 | Val. Loss: 0.1685 | Time: 2.74s \n", "[CHECKPOINTER] Validation loss decreased: (0.176475 --> 0.168494), \u001b[92m(-4.52%)\u001b[0m.\n", "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n", "[Epoch 16/50]\n", "Epoch 16 | Train Loss: 1.2047 | Val. Loss: 0.2046 | Time: 2.72s \n", "[Epoch 17/50]\n", "Epoch 17 | Train Loss: 1.1981 | Val. Loss: 0.1594 | Time: 2.68s \n", "[CHECKPOINTER] Validation loss decreased: (0.168494 --> 0.159420), \u001b[92m(-5.39%)\u001b[0m.\n", "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n", "[Epoch 18/50]\n", "Epoch 18 | Train Loss: 1.1948 | Val. Loss: 0.1701 | Time: 2.69s \n", "[Epoch 19/50]\n", "Epoch 19 | Train Loss: 1.1924 | Val. Loss: 0.1541 | Time: 2.63s \n", "[CHECKPOINTER] Validation loss decreased: (0.159420 --> 0.154148), \u001b[92m(-3.31%)\u001b[0m.\n", "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n", "[Epoch 20/50]\n", "Epoch 20 | Train Loss: 1.1851 | Val. Loss: 0.1478 | Time: 2.68s \n", "[CHECKPOINTER] Validation loss decreased: (0.154148 --> 0.147844), \u001b[92m(-4.09%)\u001b[0m.\n", "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n", "[Epoch 21/50]\n", "Epoch 21 | Train Loss: 1.1814 | Val. Loss: 0.1905 | Time: 2.67s \n", "[Epoch 22/50]\n", "Epoch 22 | Train Loss: 1.1673 | Val. Loss: 0.1482 | Time: 2.60s \n", "[Epoch 23/50]\n", "Epoch 23 | Train Loss: 1.1719 | Val. Loss: 0.1611 | Time: 2.66s \n", "[Epoch 24/50]\n", "Epoch 24 | Train Loss: 1.1771 | Val. Loss: 0.1800 | Time: 2.65s \n", "[Epoch 25/50]\n", "Epoch 25 | Train Loss: 1.1650 | Val. Loss: 0.1583 | Time: 2.67s \n", "[Epoch 26/50]\n", "Epoch 26 | Train Loss: 1.1611 | Val. Loss: 0.1416 | Time: 2.63s \n", "[CHECKPOINTER] Validation loss decreased: (0.147844 --> 0.141557), \u001b[92m(-4.25%)\u001b[0m.\n", "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n", "[Epoch 27/50]\n", "Epoch 27 | Train Loss: 1.1593 | Val. Loss: 0.1402 | Time: 2.68s \n", "[CHECKPOINTER] Validation loss decreased: (0.141557 --> 0.140151), \u001b[92m(-0.99%)\u001b[0m.\n", "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n", "[Epoch 28/50]\n", "Epoch 28 | Train Loss: 1.1600 | Val. Loss: 0.1407 | Time: 2.70s \n", "[Epoch 29/50]\n", "Epoch 29 | Train Loss: 1.1700 | Val. Loss: 0.1624 | Time: 2.67s \n", "[Epoch 30/50]\n", "Epoch 30 | Train Loss: 1.1484 | Val. Loss: 0.1412 | Time: 2.63s \n", "[Epoch 31/50]\n", "Epoch 31 | Train Loss: 1.1673 | Val. Loss: 0.1415 | Time: 2.63s \n", "[Epoch 32/50]\n", "Epoch 32 | Train Loss: 1.1712 | Val. Loss: 0.1409 | Time: 2.63s \n", "[Epoch 33/50]\n", "Epoch 33 | Train Loss: 1.1398 | Val. Loss: 0.1438 | Time: 2.68s \n", "[Epoch 34/50]\n", "Epoch 34 | Train Loss: 1.1542 | Val. Loss: 0.1260 | Time: 2.64s \n", "[CHECKPOINTER] Validation loss decreased: (0.140151 --> 0.126029), \u001b[92m(-10.08%)\u001b[0m.\n", "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n", "[Epoch 35/50]\n", "Epoch 35 | Train Loss: 1.1653 | Val. Loss: 0.1367 | Time: 2.70s \n", "[Epoch 36/50]\n", "Epoch 36 | Train Loss: 1.1419 | Val. Loss: 0.1463 | Time: 2.66s \n", "[Epoch 37/50]\n", "Epoch 37 | Train Loss: 1.1602 | Val. Loss: 0.1179 | Time: 2.64s \n", "[CHECKPOINTER] Validation loss decreased: (0.126029 --> 0.117883), \u001b[92m(-6.46%)\u001b[0m.\n", "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n", "[Epoch 38/50]\n", "Epoch 38 | Train Loss: 1.1494 | Val. Loss: 0.1095 | Time: 2.69s \n", "[CHECKPOINTER] Validation loss decreased: (0.117883 --> 0.109453), \u001b[92m(-7.15%)\u001b[0m.\n", "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n", "[Epoch 39/50]\n", "Epoch 39 | Train Loss: 1.1425 | Val. Loss: 0.1517 | Time: 2.70s \n", "[Epoch 40/50]\n", "Epoch 40 | Train Loss: 1.1475 | Val. Loss: 0.1156 | Time: 2.64s \n", "[Epoch 41/50]\n", "Epoch 41 | Train Loss: 1.1372 | Val. Loss: 0.2005 | Time: 2.67s \n", "[Epoch 42/50]\n", "Epoch 42 | Train Loss: 1.1491 | Val. Loss: 0.1219 | Time: 2.64s \n", "[Epoch 43/50]\n", "Epoch 43 | Train Loss: 1.1490 | Val. Loss: 0.1205 | Time: 2.67s \n", "[Epoch 44/50]\n", "Epoch 44 | Train Loss: 1.1481 | Val. Loss: 0.1335 | Time: 2.68s \n", "[Epoch 45/50]\n", "Epoch 45 | Train Loss: 1.1458 | Val. Loss: 0.1184 | Time: 2.66s \n", "[Epoch 46/50]\n", "Epoch 46 | Train Loss: 1.1441 | Val. Loss: 0.1129 | Time: 2.68s \n", "[EARLY STOPPING] Elapsed epochs: 8 out of 10\n", "[Epoch 47/50]\n", "Epoch 47 | Train Loss: 1.1367 | Val. Loss: 0.1086 | Time: 2.66s \n", "[CHECKPOINTER] Validation loss decreased: (0.109453 --> 0.108631), \u001b[92m(-0.75%)\u001b[0m.\n", "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n", "[Epoch 48/50]\n", "Epoch 48 | Train Loss: 1.1329 | Val. Loss: 0.1046 | Time: 2.69s \n", "[CHECKPOINTER] Validation loss decreased: (0.108631 --> 0.104579), \u001b[92m(-3.73%)\u001b[0m.\n", "[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt\n", "[Epoch 49/50]\n", "Epoch 49 | Train Loss: 1.1461 | Val. Loss: 0.1272 | Time: 2.71s \n", "[Epoch 50/50]\n", "Epoch 50 | Train Loss: 1.1353 | Val. Loss: 0.1148 | Time: 2.65s \n", "Training has finished.\n" ] } ], "source": [ "from deepaudiox import Trainer\n", "\n", "trainer = Trainer(model=model,\n", " train_dset=train_dset,\n", " validation_dset=valid_dset,\n", " epochs=50,\n", " batch_size=128,\n", " patience=10)\n", "\n", "trainer.train()" ] }, { "cell_type": "markdown", "id": "106717f5", "metadata": {}, "source": [ "## 5. Evaluation\n", "\n", "In similar manner as in the first tutorial, we use the `Evaluator` to check the performance on the held-out test set." ] }, { "cell_type": "code", "execution_count": 26, "id": "41e53110", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using GPU: NVIDIA GeForce RTX 4090\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Testing has finished. \n", "[REPORTER] Class mapping: {'_silence_': 0, '_unknown_': 1, 'down': 2, 'go': 3, 'left': 4, 'no': 5, 'off': 6, 'on': 7, 'right': 8, 'stop': 9, 'up': 10, 'yes': 11} \n", "\n", "[REPORTER] Classification Report: \n", "\n", " precision recall f1-score support\n", "\n", " _silence_ 1.00 0.94 0.97 408\n", " _unknown_ 0.60 0.99 0.75 408\n", " down 0.96 0.85 0.90 406\n", " go 0.98 0.80 0.88 402\n", " left 0.98 0.92 0.95 412\n", " no 0.91 0.93 0.92 405\n", " off 0.97 0.92 0.95 402\n", " on 0.99 0.87 0.93 396\n", " right 1.00 0.93 0.96 396\n", " stop 1.00 0.99 0.99 411\n", " up 0.97 0.96 0.97 425\n", " yes 0.99 0.98 0.99 419\n", "\n", " accuracy 0.92 4890\n", " macro avg 0.95 0.92 0.93 4890\n", "weighted avg 0.95 0.92 0.93 4890\n", "\n", "[REPORTER] Confusion Matrix: \n", "\n", "[[385 23 0 0 0 0 0 0 0 0 0 0]\n", " [ 0 405 0 1 0 2 0 0 0 0 0 0]\n", " [ 0 42 347 1 1 15 0 0 0 0 0 0]\n", " [ 0 44 13 320 2 21 0 0 0 0 2 0]\n", " [ 0 29 0 0 379 0 0 0 0 0 1 3]\n", " [ 0 21 3 3 0 378 0 0 0 0 0 0]\n", " [ 0 20 0 0 0 0 371 3 0 1 7 0]\n", " [ 0 42 0 0 0 0 8 345 0 0 1 0]\n", " [ 0 27 0 0 2 0 0 0 367 0 0 0]\n", " [ 0 4 0 0 0 0 0 0 0 407 0 0]\n", " [ 0 15 0 0 0 0 3 0 0 0 407 0]\n", " [ 0 5 0 0 1 0 1 0 0 0 0 412]]\n", "[REPORTER] Average Posteriors: \n", "\n", "_silence_ : 0.987\n", "_unknown_ : 0.989\n", "down : 0.967\n", "go : 0.926\n", "left : 0.977\n", "no : 0.951\n", "off : 0.978\n", "on : 0.961\n", "right : 0.975\n", "stop : 0.997\n", "up : 0.982\n", "yes : 0.994\n" ] } ], "source": [ "from deepaudiox import Evaluator\n", "\n", "# First load the best model checkpoint\n", "model = AudioClassifier.from_checkpoint(\"checkpoint.pt\")\n", "\n", "evaluator = Evaluator(model=model, test_dset=test_dset, class_mapping=class_mapping)\n", "\n", "evaluator.evaluate() " ] }, { "cell_type": "code", "execution_count": null, "id": "f29130ab", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "deepaudio-x (3.13.9)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.9" } }, "nbformat": 4, "nbformat_minor": 5 }